# A tibble: 5 × 9
id cesd age sex subst mcs pcs pss_fr cesd_hi
<chr> <int> <int> <fct> <fct> <dbl> <dbl> <int> <fct>
1 1 49 37 male cocaine 25.1 58.4 0 1
2 2 30 37 male alcohol 26.7 36.0 1 1
3 3 39 26 male heroin 6.76 74.8 13 1
4 4 15 39 female heroin 44.0 61.9 11 0
5 5 39 32 male cocaine 21.7 37.3 10 1
Can we use pcs to predict cesd?
Does the loess smooth match up well with the linear fit?
ggplot(help1, aes(x = pcs, y = cesd)) +geom_point(size =2) +geom_smooth(method ="loess", formula = y ~ x, se =FALSE, col ="blue") +geom_smooth(method ="lm", formula = y ~ x, se =FALSE, col ="red") +labs(title ="Linear and Loess Fits for `cesd` vs. `pcs`")
Can we use pcs to predict cesd?
A simple linear regression: fitA
dd <-datadist(help1); options(datadist ="dd")fitA <-ols(cesd ~ pcs, data = help1, x =TRUE, y =TRUE)fitA$coefficients
Intercept pcs
49.1673458 -0.3396495
Our simple linear regression
Linear Regression Model
ols(formula = cesd ~ pcs, data = help1, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 40.57 R2 0.086
sigma11.9796 d.f. 1 R2 adj 0.084
d.f. 451 Pr(> chi2) 0.0000 g 4.177
Residuals
Min 1Q Median 3Q Max
-28.4116 -7.8036 0.6846 8.7917 29.3281
Coef S.E. t Pr(>|t|)
Intercept 49.1673 2.5728 19.11 <0.0001
pcs -0.3396 0.0522 -6.50 <0.0001
Effect Sizes in fitA
ggplot(Predict(fitA, conf.int =0.90))
plot(nomogram(fitA))
Using ols to fit a larger model
dd <-datadist(help1)options(datadist ="dd")fitB <-ols(cesd ~ pcs + subst + pss_fr + sex, data = help1, x =TRUE, y =TRUE)fitB$coefficients
Fitting such a model creates a polynomial regression.
Plotting the Polynomials
p1 <-ggplot(help1, aes(x = pcs, y = cesd)) +geom_point(alpha =0.3) +geom_smooth(formula = y ~ x, method ="lm", col ="red", se =FALSE) +labs(title ="Linear Fit")p2 <-ggplot(help1, aes(x = pcs, y = cesd)) +geom_point(alpha =0.3) +geom_smooth(formula = y ~poly(x, 2), method ="lm",col ="blue", se =FALSE) +labs(title ="2nd order Polynomial")p3 <-ggplot(help1, aes(x = pcs, y = cesd)) +geom_point(alpha =0.3) +geom_smooth(formula = y ~poly(x, 3), method ="lm",col ="purple", se =FALSE) +labs(title ="3rd order Polynomial")p1 + p2 + p3
Plotting the Polynomials
Adding a polynomial in pcs
Can we predict cesd with a polynomial in pcs?
Yes, with ols() and pol(), as follows:
fitA <- ols(cesd ~ pcs, data = help1, x = TRUE, y = TRUE)
fitA_2 <- ols(cesd ~ pol(pcs,2), data = help1, x = TRUE, y = TRUE)
fitA_3 <- ols(cesd ~ pol(pcs,3), data = help1, x = TRUE, y = TRUE)
With lm(), we use poly() instead of pol()…
lmfitA <- lm(cesd ~ pcs, data = help1)
lmfitA_2 <- lm(cesd ~ poly(pcs,2), data = help1)
lmfitA_3 <- lm(cesd ~ poly(pcs,3), data = help1)
Raw vs. Orthogonal Polynomials
Predict cesd using pcs with a “raw polynomial of degree 2.”
temp1_aug <-augment(temp1, help1)temp2_aug <-augment(temp2, help1)p1 <-ggplot(temp1_aug, aes(x = pcs, y = cesd)) +geom_point(alpha =0.3) +geom_line(aes(x = pcs, y = .fitted), col ="red", linewidth =2) +labs(title ="temp1: Raw fit, degree 2")p2 <-ggplot(temp2_aug, aes(x = pcs, y = cesd)) +geom_point(alpha =0.3) +geom_line(aes(x = pcs, y = .fitted), col ="blue", linewidth =2) +labs(title ="temp2: Orthogonal fit, degree 2")p1 + p2 +plot_annotation(title ="Comparing Two Methods of Fitting a Quadratic Polynomial")
The two models are, in fact, identical.
Fits of raw vs orthogonal polynomials
Why use orthogonal polynomials?
The main reason is to avoid having to include powers of our predictor that are highly collinear.
Variance Inflation Factor assesses collinearity…
rms::vif(temp1) ## from rms package
pcs I(pcs^2)
54.66793 54.66793
Orthogonal polynomial terms are uncorrelated…
rms::vif(temp2)
poly(pcs, 2)1 poly(pcs, 2)2
1 1
Why orthogonal polynomials?
An orthogonal polynomial sets up a model design matrix and then scales those columns so that each column is uncorrelated with the others. The tradeoff is that the raw polynomial is a lot easier to explain in terms of a single equation in the simplest case.
Actually, we’ll often use splines instead of polynomials, which are more flexible and require less maintenance, but at the cost of pretty much requiring you to focus on visualizing their predictions rather than their equations.
fitA with a cubic polynomial
dd <-datadist(help1); options(datadist ="dd")fitA_3 <-ols(cesd ~pol(pcs,3), data = help1, x =TRUE, y =TRUE)fitA_3$coefficients
Linear Regression Model
ols(formula = cesd ~ pol(pcs, 3) + subst + pss_fr + sex, data = help1,
x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 81.80 R2 0.165
sigma11.5236 d.f. 7 R2 adj 0.152
d.f. 445 Pr(> chi2) 0.0000 g 5.808
Residuals
Min 1Q Median 3Q Max
-29.2871 -7.9098 0.9439 7.8319 27.6486
Coef S.E. t Pr(>|t|)
Intercept 5.2983 22.4423 0.24 0.8135
pcs 3.2272 1.5457 2.09 0.0374
pcs^2 -0.0795 0.0346 -2.30 0.0221
pcs^3 0.0006 0.0003 2.30 0.0217
subst=cocaine -3.8581 1.3015 -2.96 0.0032
subst=heroin 0.0455 1.3589 0.03 0.9733
pss_fr -0.5128 0.1371 -3.74 0.0002
sex=male -4.5982 1.3073 -3.52 0.0005
Effect Sizes in fitB_3
ggplot(Predict(fitB_3, conf.int =0.90))
A Nomogram for fitB_3
plot(nomogram(fitB_3, abbrev =TRUE))
Non-Linear Terms: Splines
Types of Splines
A linear spline is a continuous function formed by connecting points (called knots of the spline) by line segments.
A restricted cubic spline is a way to build highly complicated curves into a regression equation in a fairly easily structured way.
A restricted cubic spline is a series of polynomial functions joined together at the knots.
Such a spline gives us a way to flexibly account for non-linearity without over-fitting the model.
How complex should our spline be?
Restricted cubic splines can fit many different types of non-linearities. Specifying the number of knots is all you need to do in R to get a reasonable result from a restricted cubic spline.
The most common choices are 3, 4, or 5 knots.
3 Knots, 2 degrees of freedom, allows the curve to “bend” once.
4 Knots, 3 degrees of freedom, lets the curve “bend” twice.
5 Knots, 4 degrees of freedom, lets the curve “bend” three times.
Restricted Cubic Splines with ols
Let’s consider a restricted cubic spline model for cesd based on pcs with:
3 knots in fitC3, 4 knots in fitC4, and 5 knots in fitC5
dd <-datadist(help1)options(datadist ="dd")fitC3 <-ols(cesd ~rcs(pcs, 3), data = help1, x =TRUE, y =TRUE)fitC4 <-ols(cesd ~rcs(pcs, 4), data = help1, x =TRUE, y =TRUE)fitC5 <-ols(cesd ~rcs(pcs, 5),data = help1, x =TRUE, y =TRUE)
Model fitC3 (3-knot spline in pcs)
fitC3
Linear Regression Model
ols(formula = cesd ~ rcs(pcs, 3), data = help1, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 40.79 R2 0.086
sigma11.9901 d.f. 2 R2 adj 0.082
d.f. 450 Pr(> chi2) 0.0000 g 4.206
Residuals
Min 1Q Median 3Q Max
-28.3462 -7.7005 0.5098 8.6376 29.8454
Coef S.E. t Pr(>|t|)
Intercept 47.3631 4.7053 10.07 <0.0001
pcs -0.2908 0.1187 -2.45 0.0146
pcs' -0.0624 0.1363 -0.46 0.6471
Effect Sizes in fitC3
ggplot(Predict(fitC3, conf.int =0.90))
A Nomogram for fitC3
plot(nomogram(fitC3, abbrev =TRUE))
Model fitC4 (4-knot spline in pcs)
fitC4
Linear Regression Model
ols(formula = cesd ~ rcs(pcs, 4), data = help1, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 51.31 R2 0.107
sigma11.8648 d.f. 3 R2 adj 0.101
d.f. 449 Pr(> chi2) 0.0000 g 4.590
Residuals
Min 1Q Median 3Q Max
-28.3147 -8.2830 0.8559 8.8866 26.5458
Coef S.E. t Pr(>|t|)
Intercept 33.3298 6.5742 5.07 <0.0001
pcs 0.1464 0.1856 0.79 0.4308
pcs' -1.4383 0.4497 -3.20 0.0015
pcs'' 6.2561 1.9076 3.28 0.0011
Effect Sizes in fitC4
ggplot(Predict(fitC4, conf.int =0.90))
A Nomogram for fitC4
plot(nomogram(fitC4, abbrev =TRUE))
Model fitC5 (5-knot spline in pcs)
fitC5
Linear Regression Model
ols(formula = cesd ~ rcs(pcs, 5), data = help1, x = TRUE, y = TRUE)
Model Likelihood Discrimination
Ratio Test Indexes
Obs 453 LR chi2 54.64 R2 0.114
sigma11.8345 d.f. 4 R2 adj 0.106
d.f. 448 Pr(> chi2) 0.0000 g 4.744
Residuals
Min 1Q Median 3Q Max
-29.396 -7.928 1.016 8.762 26.974
Coef S.E. t Pr(>|t|)
Intercept 39.0631 7.8282 4.99 <0.0001
pcs -0.0436 0.2332 -0.19 0.8517
pcs' -0.2952 1.0079 -0.29 0.7697
pcs'' -3.1835 4.8079 -0.66 0.5082
pcs''' 14.4216 8.3721 1.72 0.0857
Effect Sizes in fitC5
ggplot(Predict(fitC5, conf.int =0.90))
A Nomogram for fitC5
plot(nomogram(fitC5, abbrev =TRUE))
Fitting fitB including a 5-knot RCS
dd <-datadist(help1)options(datadist ="dd")fitB5 <-ols(cesd ~rcs(pcs,5) + subst + pss_fr + sex, data = help1, x =TRUE, y =TRUE)fitB5$coefficients